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Abstract 

We study the problem of high-dimensional regression when there may be interacting 
variables. We introduce a new idea, which we call Backtracking, that can be incorporated 
into many existing high-dimensional methods that only fit additive models. It works 
by iteratively building up an increasing set of candidate interactions, which, along with 
the main effects, it makes available for selection by the base regression procedure. Our 
method is computationally fast, and makes use of parallel processing. In the case of 
£i-penalised least squares (Lasso), we give some theoretical support for our procedure. 
The effectiveness of our method when applied to regression and classification problems is 
demonstrated on simulated and real data sets. 

Key words: Backtracking, high-dimensional data, interactions, Lasso, parallel computing. 

1 Introduction 

In recent years, much progress has been made in the field of high-dimensional regression. 
We now have a powerful array of methods, many of which are supported by rich theoreti- 
cal evidence, are computationally tractable, and have the ability to work well in the high- 



dimensional setting (see Biihlmann and van de Geer (2011) for a book-length overview, and 
references therein). Many familiar models from classical (low-dimensional) statistics can now 
be fitted in situations where the number of variables p, greatly exceeds the number of ob- 
servations n. Examples include generalised linear models (van de Geer, 2008) and additive 



models (Ravikumar et al, 2009; Meier, van de Geer and Biihlmann 2009), to name but two. 



These are often fitted by minimising an objective consisting of the sum of an empirical risk 
term and a sparsity-inducing penalty term. Thus they perform simultaneous model selection 
and parameter estimation. 
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Typically, a regularisation or tuning parameter, A, controls the relative contribution of the 
penalty term to the objective. Rather than fixing A, we usually compute a path of minimisers 
as A varies. One can then select a final A from this solution path using cross-validation, for 
example. 

Despite the advances, there has been comparatively little development on fitting models 
which include interactions. The barrier seems to be the way the effective number of variables 
increases dramatically, even when only first-order interactions are included. Since there are 
p(p — l)/2 possible first-order interactions, when p is itself of the order of several thousands, 
as is common in many applications, this presents a real computational challenge. From a 
statistical point of view, there are also serious problems. The main effects can be swamped 
by the vastly more numerous interaction terms and without proper regularisation, stand little 



chance of being selected in the final model (see Figure lb ). It is possible to tackle this problem 
by enforcing that an interaction term can only be in the model when the corresponding main 
effects are also present. Such a condition can be imposed using cleverly constructed penalty 



functions (Zhao, Rocha and Yu, 2009; Radchenko and James, 2010), but only at the expense 
of increasing the computational burden. When higher-order interactions are considered, the 
difficulties become even more severe. 

In this paper, we propose a new method, which we call Backtracking, for fitting models 
with interactions to high-dimensional data. Backtracking is not a new estimator: rather it is a 
method that can be incorporated into many procedures which produce a path of solutions. It 
works by iteratively building up an increasing set of candidate interactions, which, along with 
the main effects, it makes available for selection by the regression procedure. The final result is 
a collection of solution paths Pi, . . . , Pr, each with a different, data-driven choice of candidate 
interactions present in the corresponding design matrix. This family of solution paths also 
includes the solution path with no interactions, in a very natural way. Thus, provided we are 
able to pick the best solution from this family paths, our method can do no worse than ignoring 
all interactions (which is what is often done in practice in high-dimensional situations), but 
could quite possibly be better when interactions are present. 

Remarkably, even when second or higher order interactions are considered, computation 
of the full collection of solution paths is very fast. This is for three reasons. Firstly, we do not 
consider all possible sets of candidate interactions: a task which would have combinatorial 
complexity in any case. Instead, our sets of candidate interactions are constructed in a 
hierarchical manner. Secondly, rather than computing each of the solution paths from scratch, 
for each new solution path Pk+i, we first track along the previous path P^ to find where Pk+i 
departs from P^. This is the origin of the name Backtracking. Typically, checking whether 
a given trial solution is on a solution path requires much less computation than calculating 
the solution path itself, and so this Backtracking step can be made very fast. Thirdly, when 
the solution paths do separate, the tail portions of the paths can be computed in parallel. In 
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addition to the methodological contribution of our paper, for the case of using Backtracking 
with the Lasso, we are able to give sufficient conditions for our algorithm to correctly identify 
all non-zero terms in the linear model with interactions. Our result is finite sample in nature, 
and allows for p> n. 

The rest of the paper is organised as follows. In Section [2] we describe an example which 
provides some motivation for our Backtracking method. In Section [3] we develop our method 
in the context of the Lasso for the linear model. We build up to our final algorithm in stages, 
presenting the final version in Section 3.2.2 In Section |4j we present some theoretical results 
which aim to give a deeper understanding of the way in which Backtracking works. Proofs 
are collected in the appendix. In Section [5j we describe how our method can be extended 
beyond the case of the Lasso for the linear model. Finally, in Section [6] we present simulation 
results and analyses of three publicly available datasets that demonstrate the effectiveness of 
Backtracking. 



Some work related to our method here is the MARS procedure of Friedman ( 1991 ) and the 



proposal of Bickel et al. (2010). In Friedman (1991), an increasingly complex model is built 
up iteratively, in a similar fashion to Backtracking. However, the method does this in a greedy 
fashion whereas here we consider incorporating a similar model building strategy into methods 



based on convex optimisation. Bickel et al. (2010) propose a procedure involving sequential 
Lasso fits which, for some predefined number K, selects K variables from each fit and all 
interactions between those variables are added as candidate variables for the following fit. 
Our method works differently by including interaction terms as soon as their corresponding 
main effects are selected, thereby obviating the need to choose K, and allowing for the use of 



our fast algorithm (Section 3.2.2) to produce entire solution paths for the parameters. 

Work that considers fitting models with interactions in lower- dimensional situations than 



we have in mind here includes Lin and Zhang (2006), Yuan et al. (2009), Zhao, Rocha and 



Yu (2009) and Radchenko and James (2010). Applications of Backtracking to the methods 



in the latter two of these are discussed in Section [H 



2 Motivation 

Consider the linear model: 

Y = fi°l + X(3° + e; (2.1) 

where /z° is an intercept term; 1 is an n- vector of l's; X is an n X p design matrix; /3° G W is 
a vector of coefficients; and Y is an n-vector of responses corrupted by noise e, which we take 
to have mean 0. We may assume here that the columns of X are centred and scaled so that 
(X T X)jj /n = 1. One of the most popular methods for estimating j3° in the high-dimensional 
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setting is the Lasso (Tibshirani 



1996). This satisfies 



(A, 0) = argmin \\Y -/si- X/3\\ 2 2 + A H/3HJ. (2.2) 



Now suppose we have data that we are willing to assume follows the linear model (2.1 ), but 
only when the design matrix X has been augmented with some first-order interaction terms. 
A priori we do not know which interaction terms should be included. Suppose further that 
we believe interactions are only present in the true model when main effects are. Given the 
problems with including all possible interactions in design matrix discussed in the previous 
section, one might consider taking the following steps: 

1. Select a model from the Lasso applied to the original design matrix X; 

2. Add to X all interactions between the variables selected in the previous step; 

3. Re-run the Lasso on the augmented design matrix. 

However, even with this method, one can run into problems, as the following example illus- 
trates. 

Here, we will use X^ to denote the j th column of X, which we then regard as a vector. 
We index observations by i. Consider first (Z- \z± ) generated from a mean zero 

multivariate normal distribution with Vav(Z^) = 1, j = 1,2,3, Co\(z\ l \ Z^) = and 
Cov(Z^\ Z^) = Cov(Z^ 2 \ Z^) = 1/2. Independently generate R^' and each of which 
takes only the values {—1, 1}, each with probability 1/2. We form the i til row of the design 
matrix as follows: 



f } )i4 2) r /4 , 





d(1) 
=R} 'sgn 


xf> 


=rM\z? 


x°» 


=R} 'sgn 


x<? 




xf> 


=*<»>. 



(i) 

The remaining A- , j = 6, . . . ,p are independently generated from a standard normal distri- 
bution. Finally, we generate the response according to 

Yi = £ PiX? + fhxM 1 * + + /? 9 Af >Af > + e t (2.3) 

where ei ~ N(0, a 2 ). When /% = — 5(^7 + Ps) the model above has the following properties: 

(i) For every interaction present in the model, the corresponding main effects are also 
present. 
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(ii) Since X^xj® = and xf^X^ = Z^ \ X? is uncorrelated with the response, and 
moreover it is uncorrelated with any linear combination of {X^ : j ^ 5}. 

Thus when p is large, if we fit the Lasso with no interactions, it is very unlikely that X^ will 
be selected. Then if we add all first-order interactions between the selected variables and fit 
the Lasso once more, the interaction between X^ and X^ will not be included. Of course, 
one can again add interactions between selected variables and compute another Lasso fit, and 
then there is a chance the interaction will be selected. Thus it is very likely that at least three 
Lasso fits will be needed in order to select the right variables. 



Figure la shows the result of applying the Lasso to data generated according to (2.3) with 
200 independent and identically distributed (i.i.d.) observations, p = 500, o chosen to give a 
signal-to-noise ratio of 4, and 

P = (-1.25, -0.75, 0.75, -0.5, -2, 1.5, 2, 2, 1) T . 

As expected, we see variable 5 is nowhere to be seen and instead many unwanted variables are 



selected as A is decreased. Figure lb illustrates the effect of including all p(p — l)/2 possible 
interactions in the design matrix. Even in our rather moderate-dimensional situation, we are 
not able to recover the true signal. Though all the true interaction terms are selected, now 
both variables 4 and 5 are not present in the solution paths and many false interactions are 
selected. 

Although this example is rather contrived, it illustrates how sometimes the right interac- 
tions need to be augmented to the design matrix in order for certain variables to be selected. 
Even when interactions are only present if the corresponding main effects are too, main effects 
can be missed by a procedure that does not consider interactions. Except purely by chance, 
the variable X^ can only be selected by the Lasso if either the interactions between X^ 
and X& or X® and X^ are present in the design matrix. We also see that multiple Lasso 
fits might be needed to have any chance of selecting the right model. 

This raises the question of which tuning parameters to use in the multiple Lasso fits. 
When applying the Lasso, we compute a solution path as the tuning parameter A decreases 
from oo. Since making sure the interactions in the true model are present as candidates for 
selection by the Lasso can be important, even for the selection of main effects, it is sensible 
to include suspected interactions in the design matrix 'as soon as possible'. That is, if we 
progress along the solution path from A = oo, and two variables enter the model, we should 
immediately add their interaction to the design matrix and start computing the Lasso again. 
We could now disregard the original path, but there is little to lose, and possibly much to 
gain, in continuing the original path in parallel with the new one. That way, if the true model 
contains no interactions, we have a solution path whose estimates are unhindered by the 
addition of spurious interactions. We can then repeat this process, adding new interactions 
when necessary, and restarting the Lasso, whilst still continuing all previous paths in parallel. 



5 



In the following section we formalise this idea. 



3 Backtracking with the Lasso 
3.1 A naive algorithm 

In this section we introduce a version of the Backtracking algorithm applied to l\ penalised 



least squares (2.2). First, we present a naive version of the algorithm, which is easy to 



understand. Later in Section 3.2, we show that this algorithm performs a large number of 
unnecessary calculations, and we give a far more efficient version. 

We begin by defining some notation. Let X be the original n x p design matrix, with no 
interactions. In order to consider interactions in our models, rather than indexing variables 
by a single number j, we use subsets of {1, ...,p}. Thus by variable {1,2}, we mean the 
interaction between variables 1 and 2, i.e. X^X^ 2 \ centred and scaled to have £2 norm ^/n, 
with the product being componentwise. By variable {1}, we simply mean the first column 
of X. Let us denote the power set operator by V . For C C V({1, ■ ■ ■ ,p}), we can form 
a modified design matrix Xq, where the columns of Xq are given by the variables in C, 
and are scaled appropriately. Thus C is the set of candidate variables available for selection 
when design matrix Xq is used. This subsetting operation will always be taken to have been 
performed before any further operations on the matrix, so in particular X^ means (Xc) T ■ 
We will consider all associated vectors and matrices as indexed by variables, so we may speak 
of component {1, 2} of j3, denoted P{i2}> ^ P was multiplying a design matrix which included 
{1,2}. Further, for any collection of variables A, we will write /3 a for the subvector whose 
components are those indexed by A. To represent an arbitrary variable, we shall use v or u 
rather than j, to remind us that variables are now sets. 



We will often need to express the dependence of the Lasso solution j3 (2.2) on the tuning 
parameter A and the design matrix used. We shall write (3(\, C) when Xc is the design 
matrix. We will denote the set of active components of a solution f3 by A(/3) = {v : j3 v 7^ 0}. 
Finally, for any A C V({1, . . . ,p}), let 

1(A) ={dC{1,...,j)}: for all u C v, u ^ 0, u G A}. 

In other words, 1(A) is the set of variables not in A, all of whose corresponding lower order 
interactions are present in A. For example, X({{1}, {2}}) = {{1, 2}}, andX({{l}, {2}. {3}}) = 
{{1,2}, {2, 3}, {1,3}}. 

As mentioned in section [TJ Backtracking relies on a path algorithm for computing, in our 
case here, the solution path of \2.2\. Several such algorithms are available: the homotopy 



method of Osborne, Presnell and Turlach (2000a) and Osborne, Presnell and Turlach (2000b), 



and the closely related LARS algorithm (Efron et al. 2004) make use of the piecewise linearity 
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0,2 0.4 0.6 0.8 1.0 1.2 1.4 

(a) Main effects only 



0.4 0.6 0.8 1.0 1.2 1.4 

(b) All interactions added 




0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.2 0.4 0.6 0.8 1.0 1.2 1.4 

(c) Step 3: {1, 2}, {2, 6}. {1, 6} added in step 2. (d) Step 4: {1, 3}, {2, 3}, {3, 6}. 




0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.2 0.4 0.6 0.8 1.0 1.2 1.4 



(e) Step 5: {1,4}, {2,4}, {3,4}, {4,6}. (f) Step 6: {1,5}, {2,5}, {3,5}, {4,5}, {5,6}. 

Figure 1: The coefficient paths against A of the Lasso with main effects only, (a); the Lasso 
with all interactions added, (b); and Backtracking with k = 3, . . . , 6, ((c)-(d)); when applied 
to the example in Section [2j Below the Backtracking solution paths we give \ C^-i- 
the interactions which have been added in the current step. The solid red, green, yellow, 
blue, cyan and magenta lines trace the coefficients of variables 1, . . . , 6 respectively, with the 
alternately coloured lines representing the corresponding interactions. The dotted blue and 
red coefficient paths indicate noise main effect and interaction terms respectively. Vertical 
green and red dotted lines give the values of A^. tart and A^ dd respectively. 
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of the solution path, and are able to compute it exactly. More recently, coordinate descent 
methods have been demonstrated to be significantly faster in high-dimensional situations, and 



can be applied to fit generalised linear models and others (Friedman, Hastie and Tibshirani 



2010). These, however, compute the solution path at a discrete set of A values. Any of 



these algorithms are suitable for use in conjunction with Backtracking, but we will focus our 
discussion on the coordinate descent method, because of the advantages already mentioned. 

We are now in a position to introduce the naive version of our Backtracking algorithm 
applied to the Lasso. We will assume that Y is centered, so no intercept term is necessary. Let 
Ai > • • • > Xl be the grid of A values at which we are computing the solutions. In Algorithm 
[l] below, B is a function whose argument is the index of grid values. It simply stores the Lasso 
solutions for the different A values. We have included three lines of pseudocode which are 
computationally extraneous in the current situation, but will be helpful for the explanation of 



the modifications to be presented in Section 3.2. These are: line 6, which stores the residual 
vector in R; line 7, which stores the maximum value of k at which the Lasso solution has 
been computed; and line 12, which sets l^ dd to be the grid index at which we add to the set 
of candidates Ck- 

Line 13 needs some further explanation. Here a new process is spawned, which we denote 
Jobfc, whose job it is to continue the solution path Pk, using coordinate descent for example. 
In lines 19-20, the output from each of these child processes is collected together. Note that 
we do not require that we have T — 1 extra CPUs available, since once each CPU has finished 
a job, it can move on to the next one. 

Looking at the nested while loop, we see that given a set of candidates Ck, the algorithm 
decrements A until either the active set can support interaction terms currently not in Ck, i.e. 
until I ^ Ck, or until a perfect fit or A^ is reached. If it is the case that interaction terms I 
can be added to Ck, a process is spawned to continue the current solution path, an updated 
set of candidates, Ck+i is formed, and A is decreased from Ai once more. Termination of the 
algorithm is guaranteed since \Ck\ increases at each iteration, and it cannot exceed 2 P — 1. 
Though in practice, termination will typically occur long before Ck = V({1, . . . ,p}) \ {0}, 
for both computational and statistical reasons, we recommend terminating the algorithm if 



\Ck\ —p becomes too large (see Section 3.1.1). We note also that the possible interactions to 
consider can easily be restricted to, say, first-order interactions. 

The final output of the algorithm is a collection of solution paths, each one of which 
corresponds to a different set of candidates. Figures lc - If] show steps 3-6 (i.e. k = 3, . . . ,6) 
of Backtracking applied to the example described in Section [2] Note that Figure la is in fact 
step 1. Step 2 is not shown as the plot looks identical to that in Figure la We see that when 
k = 6, we have a solution path where all the true variable and interaction terms are active 
before any noise variables enter the coefficient plots. 

In the following section we explain how one can choose a final estimator from this collection 



S 



Algorithm 1 A naive version of Backtracking for the Lasso. 



l: Ci <- {{!}, . . . , M}; fc <- 1; Z <- 1; 5(0) <- 0; 7 <- 0; i2(l) <- F 



2 


loop 


3 


while 7 C C k and J < L and P(Z) / do 


4 


B(l) <- fcXuCk) 


5 


I^l(A(B(l))) 


6 


R(l)<-Y-X Ck B Ck (l) 


7 


K{1) <- fc 


8 


Z <- Z + 1 


9 


end while 


10 


P fc <- (fl(Z') : Z' < Z) 


11 


2 add ^_ / _ 1 


12 


if 7 g C fc then 


13 


Spawn Jobfc «— ContinuePath(Pfc, Cfc) 


14 


Cfc+i «— Cfe U 7 


15 


Z <- 1 


16 


fc <- fc + 1 


17 


else 


18 


T <- k 


19 


for k = 1 to T - 1 do 


Ensure: Job^ has finished 


20 


Pfc «— Output (Jobfc) 


21 


end for 


22 


return (Pi,...,Pr) 


23 


end if 


24 


end loop 



9 



of paths. 



3.1.1 Cross-validation 

Where the Lasso has one tuning parameter, with Backtracking we have two: A and k, the rank 
of the path. When using the Lasso, the tuning parameter used to construct the final estimator 
is typically chosen by cross-validation. Since the number of paths T can vary for collections 
of paths calculated on different folds, cross-validation cannot be applied immediately. This 
minor difficulty can easily be overcome if we agree that Pr+m '■= Pt for all m£N, and that 
if two sets of tuning parameters give the same cross-validation score, we prefer the one with 
lower k. 

In many cases we may be performing Backtracking and forcing early termination if Ct 
gets too large. If the (A, k) pair with minimal cross-validation score has k less than each of 
the maximum number of steps reached on each of the folds, one can think of this as a local 
minimiser of the cross-validation score when the size of C& is unrestricted. Often, this may in 
fact be the global minimiser, and in these cases calculating the full collection of solution paths 
without early termination would result in unnecessary computation. Even when this is not 
true, since one expects the variance of Xc k (3(\,Ck) to increase with k, there are statistical 
reasons one might prefer the restricted minimiser. 

In fact, the same reasoning supports terminating solution paths when the active set gets 
large and so selecting A to be a possibly local minimiser of the cross-validation score. Since 
the bulk of the computation in the Lasso solution path occurs when the active set is large, 
this can result in big computational savings. 

In many situations, rather than using the final estimator from the Lasso, it is often better 
to use the active sets from the Lasso solution paths, and apply a further estimation procedure 
to subsets of the original design matrix whose columns are given by the variables in the active 
sets. Sensible candidates for this second estimation procedure are ordinary least squares and a 
further Lasso fit; these choices giving methods known as (a variant of) the LARS-OLS hybrid 
(Efron et al, 2004) and the relaxed Lasso ( Meinshausen| [2007 ) respectively. An alternative 



to this approach is the adaptive Lasso of Zou (2006). All of these methods can be used 



in conjunction with Backtracking and for our numerical results in Section 6.1 we use the 
LARS-OLS hybrid. 



3.2 Speeding up computation 
3.2.1 An improved algorithm 

The process of performing multiple Lasso fits is computationally cumbersome, and an imme- 
diate gain in efficiency can be realised by noticing that the final collection of solution paths 
is in fact a tree of solutions: many of the solution paths computed will share the same initial 
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portions. To discuss this, we first recall that by considering subgradients or simply one-sided 
directional derivatives, (3 is a solution to (2.2) when the design matrix is Xq if and only if 



(Y - X C P) = Xsgn(p v ) forA^O 



r 



4 J 



< A for /3 V 



(3.1) 
(3.2) 



Note the X c fil term vanishes as the columns of Xq are assumed to be centred. These are 
often referred to as the KKT conditions for the Lasso in the literature. 



Write \f A for A /add and set A : 



start 
k+1 



such that the following holds for all A > A^f^*: 



A^atart to be the minimal element of {Ai, . . . , A^ } 



X, 



c k+1 \c K (Y-XcJ(\,C k )) 



< A. 



Then crucially, for all A > A 



start 
k+1 > 



(3.3) 



Pc k+1 \C k (X,C k+1 ) = and 

Note the existence of AjSJ is guaranteed provided Ai is sufficiently large, since /3(A, C k +i) = 
and /3(A, C k ) = for A sufficiently large. The modifications needed when Ai is not large enough 
are trivial and we do not discuss them here. We can use this knowledge to replace line 15 
in Algorithm [TJ which sets I to 1 after the set of candidates has changed, with Algorithm [2] 
below. 

Algorithm 2 An improvement on line 15 of Algorithm [T] 



V <- l 



Xl 



while 

I'i-l' + l 
end while 

jstart <_ J/ _ ! 
I <- lf?$ + 1 



< n\ v and V < lf d do 



Notice that the condition to be checked in the while loop involves the multiplication of 
a \C k+ \ \C k \ x n matrix by a vector of length re, and thus has computational complexity 
0(\Ck+i \ Ck\n). This computation is very fast, especially compared to the alternative of 
calculating /5(Aj, C k +i)- Furthermore, the while loop can, if necessary, be executed in parallel, 
making the 'Backtracking' step very fast indeed. However, since parallel computing power 
may well need to be reserved for processing the various jobs assigned to them, in the next 
section we present another version of the Backtracking algorithm that allows us to bypass 
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most of the calculations in the while loop. 



3.2.2 The final algorithm 

When the current set of candidates changes from Ck to Ck+i, Algorithm [2] searches along 
Pk from Ai to A^ dd , checking the validity of each point on the path as a solution /3(A, Ck+i) 
with the enlarged candidate set. If the search reaches Aj^ dd , we would have done a fair few 
calculations simply to end up, quite literally, back where we started. Motivated by this 
observation, in Algorithm [3] we present a further improvement on line 15 of Algorithm [TJ 



Algorithm 3 A further improvement on line 15 of Algorithm [T] 



if 



X C k+1 \C k R ( l k 



add" 



< nA^ dd then 



/add 



/start 
l fe+l 

else 

Set to be any I' < lf d such that 



end if 

7 St 



< nXy and 



I <- V&f + 1 



In other words, we first check whether Pk+i can simply be made to extend Pk- If not, 
we search for any point where Pk and Pk+i agree but after which they disagree, rather than 
the first such point. Such a search can be implemented by a bisection method which would 
terminate in at most 0(log 2 ^ dd ) steps. Since Z^ dd < L and L would not usually be more 
than a few hundred, this modified search is very cheap. 

A possible disadvantage of this approach is that the solution paths computed will only 
be approximate. If we let /3(A,C), defined for C = C%, . . . ,Ct, give the solution paths 
obtained by Backtracking with bisection search, then we only know that for A < Aj£?j t we 
have /3(A, Cfc+i) = /3(A, Cfc+i), the true solution. For A > A^fJ, this need not be the case, 
as variables in Ck+i \ Ck could have entered the solution path /3(A, Ck+i) at an earlier stage, 
but then left at or before A^f^. In practice, variables leaving and re-entering the solution 
path does not happen too often. In fact, one might say that an active variable that is about 
to leave the active set should be regarded as suspicious, and it makes sense to include it only 
at a later stage along the path. Furthermore, for the theory we develop in Section [4j we lose 
nothing by using the approximate solutions. For these reasons, we prefer to use the bisection 
search method, and from now on, we will use Backtracking to mean precisely this variant. 

One computational shortcut we have not mentioned yet concerns the fact that when 
^start _ ^add^ ^ e so lution paths Pk and Pk+i will still agree beyond X S k+i an< ^ * ne solution 
tree will not branch at this point. In this case our algorithm, as it has been presented, 
will perform some unnecessary computation, though if this redundancy were removed the 
parallel computational complexity would remain the same. It is straightforward to modify 
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the algorithm so that processes are only spawned at branch points of the solution tree, but 
the details are rather technical and we do not discuss them here. 



4 Theoretical properties 

In this section, we give some sufficient conditions for the Lasso with Backtracking to arrive 
at a set of candidates, C°, that contains all of the true interactions, and only a few false 
interactions. On the event on which this occurs, we can then apply many of the existing 
results on the Lasso, to show that the solution path /3(A, C°) has certain properties. As an 
example, we give sufficient conditions for the existence of a A* such that {v : (3 V (\*,C°) ^ 0} 
equals the 'true' set of variables. Our results here are not intended for practical use but 
instead aim to give a better understanding of the way Backtracking works. 

We work with the normal linear model with interactions, 



where : ~ ' N(0, a 2 ), and to ensure identifiability, X s o has full column rank. Recall that we 
also assume that the columns of X are centred and scaled to have £2 norm ^/n. 



for the vector of regression coefficients obtained by regressing /° on the columns of Xg- Also, 
let P s = Xs(XgXs)~ 1 Xg denote orthogonal projection on to the space spanned by the 
columns of X$- For any two candidate sets S, M C V({1, ■ ■ . , p}), define 



Y = p°l + X so p° s o + e, 



(4.1) 



We write / for X s o f3g . For a set S such that Xs has full column rank, we shall write 



f3 s = (X^X s )-^f 



n 



1 




Finally, for any square matrix S, let c m i n (S) denote its minimal eigenvalue. 
The following 'entry condition' will play a key role in our theory. 
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4.1 The entry condition 

For M C C C P({1,. . . u G S := C\ M and 77 > 0, we shall say that the Ent(u, M, C; 77) 
condition holds if, Xg has full column rank, and the following holds, 



sup 

T S eRl s l : ||T S |l <i 



< 1, 



> max 



1 

n 


lW r (/-F s )/° 


+ 2r? 


1 - 




1 



+ 77 



(4.2) 



(4.3) 



In Lemma [3] given in the Appendix, we show that this condition is sufficient for variable 
v to enter the active set before any variable in M, when the set of candidates is C and 
llX^ell < 77. In addition, we show that v will remain in the active set at least until some 

II L> II co — ' ' 

variable from M enters the active set. 



The condition ( 


4.2) is closely related to irrepresentable conditions (see 


Meinshausen and 


Biihlmann 


( 


2006 


), 


Zhao and Yu (2006 


), 


Zou 


( 


2006 


), 


Biihlmann and van de Geer 


( 


2009 


), 



Wainwright ( 2009 ) , for example) , which are used for proving variable selection consistency of 
the Lasso. Indeed, when S is the set of true nonzero coefficients, it can be shown that the 
condition, 



Sm,5S 5 csgn(/?; 



0- 



< 1, 



(4.4) 



is essentially necessary for variable selection consistency of the Lasso. If we require this to 

3§) 



hold for all possible sign vectors sgn(/3<j), we arrive at (4.2). 



The second part of the entry condition (4.3) asserts that coefficient v of the regression 
of / on Xs must exceed a certain quantity that we now examine in more detail. The 
ijW (7-P s )/° term is the sample covariance between X^ u \ which is one of the columns 
of Xm, and the residual from regressing f° on Xs- 

To understand the (S^)^ term, without loss of generality take v as {1} and write 

D. Using the formula for the inverse of a block matrix and 



J S\{v},{v} 



S,SJ 

b and Ss^}^,,} 



writing s for we have 



< 1 + 



'l + b T (D-bb T )- 1 b s 
^ -(D-bb T )-*b , 

Praia (Ss,s) 



In the final line we have used the Cauchy-Schwarz inequality and the fact that if w* is a unit 
eigenvector of D — bb T with minimal eigenvalue, then 



Cmin(-t> - bb 1 



^s,s 




> c a a a (T,s ) s)\/l + \b T W*\ 2 > Cr^ n (T, S ,s)- 
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Thus when variable v is not too correlated with the other variables in S, and so 



is small, 



will not be too large. Even when this is not the case, we still have the bound 



< 

1 Cmin 



Turning now to the denominator, 



is the i\ norm of the coefficient of regres- 



sion of X^ 1 on Xs, and the maximum of this quantity over u € M gives the left-hand side of 

will 



(4.2). Thus when u is highly correlated with many of the variables in S, 



be large. On the other hand, in this case one would expect — P S )X ( - U ^ 
so to some extent the numerator and denominator compensate for each other. 



J S,S^S,{u} 



to be small, and 



4.2 Statement of results 

First we establish some notation. For i = 1,2, define Sf = {v £ S° : \v\ = i}. Thus S® is the 
set of main effects and are the first order interactions. In addition, let 

if = {v : \v\ = l,v C u, some u G 5°} 

be the set of interacting variables, and without loss of generality assume if = {{1}, . . . , {\lf |}}. 
We shall make the following assumptions: 

(Al) S° contains only main effects and first order interactions. 

(A2) 7° C Sf. 

(A3) There is some If C Sf C Sf, and some ordering of the variables in if, which without 
loss of generality we take to simply be 1, ... , \lf\, such that for each {j} 6 if, we have, 

For all A : X({{1}, 1}}) CK l(Sf) 

Ent({j}, {d UA)\ S°, Ci U A; rj) holds. 

where S° = S° n l{Sf) and 

( , ^o, v > + 21og( P +^|^ 

Assumption (A2) says that interactions are only present in S° if both corresponding main 
effects are also present. 

Assumption (A3) is more complicated, and first we discuss the implications for variable 
{1}. The condition ensures that whenever the candidate set is enlarged from C\ to also include 
any subset ofl(Sf), variable {1} enters the active set before any variable outside S°, and 
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moreover, it remains in the active set at least until a variable outside S° enters. For j > 2, 
we see that the enlarged candidate sets for which we require the entry conditions to hold, are 
fewer in number. Variable only requires the entry condition to hold for candidate sets 

that at least include X({{1}, . . . , {|/^| — 1}}) and thus include almost all of 5°. What this 
means is that we require some 'strong' interacting variables, for which when / is regressed 
onto a variety of sets of variables containing them (some of which contain only a few of the 
true interaction variables), always have large coefficients. Given the existence of such strong 
variables, other interacting variables need only have large coefficients when /° is regressed 
onto sets containing them that also include many true interaction terms. Going back to the 
example in Section 2 variable 5 has \P{$y\ ~ f° r an S C {1,...,6}, but \Pf^y\ > once 
{1,2} G S or {3,4} G S. 

The reason we use the sets 5° and S° rather than their larger counterparts, 5° and S°, 
is that there may be some very weak signals in S® \ I^. We do not want to require that 
the interacting variables remain in the active set all the way until these weak variables are 
selected, as the entry conditions would dictate. 

We are now in a position to state our main theorem. Although the Backtracking algorithm 
was presented for a base path algorithm that computed solutions at only discrete values, for 
the following theorem, we need to imagine an idealised algorithm which computes the entire 
path of solutions. Explicitly, we require that our algorithm outputs a collection of paths 

{P(\,C k ):\£ [0,oo],l<fc<T} 

for which there exists a A| tart sequence with A| tart = oo, that satisfies /3(A, C k ) = /3(A, C k ) for 
all A < A|* art , and for 2 < k < T, 

-4(/3(A| tart , C k )) = -4(/3(A s fc tart , C fc _!)). 

In addition, we will assume that we only allow first-order interactions in the Backtracking 
algorithm. 

Theorem 1. Assume (A1)-(A3) and let C° = C\ Ul(Sf). With probability at least 1 - 
exp(— 1 2 /2), there exists a k* such that C° 5 C k * 5 S° . 

Theorem [T] gives sufficient conditions for Backtracking to produce a set of candidates that 
includes S°, but no interactions among variables in C\ \ S°. Once we have such a set of 
candidates, we are essentially in the familiar 'Lasso with the linear model world', and we 
do not need to worry about interactions. The one caveat is that the path /3(-,Cfc*) need 
only coincide with /3(-,Cfc*) after A| tart . If this subtlety is taken into account, many of the 
theorems concerning the Lasso with the linear model can be applied. As an example, we give 
the following corollary. 
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Corollary 2. Assume (Al)-(A3). Writing N = C° \ S°, further assume 



and that for all v £ S° , 



where 



Sjv,5° s 50 1 )S oSgn(/3^ , 



< 1; 



1/aSl > 





sgn(/30 ) T (S- 1 s0 )W 




1 - 


S 7V,s o£ 50 1 iS oSgn(^ ) 


oo 



< ^ KO, /v U t 2 + 21og(|gQ|) 

y nc m i n (L s o i5 .o) 

Then with probability at least 1 — 3exp(— 1 2 /2), there exist k* and X* such that 

A0(\*,C k *)) = S°. 



Note that if we were to simply apply the Lasso to the set of candidates C := C% U 
X(Ci) (i.e. all possible main effects and their first-order interactions), we would require an 
irrepresentable condition of the form 



S N a11 , S° S° Sgn ( A 



' 
5°- 



< 1, 



where A ra11 = C all \ S°. Thus we would need 0(p 2 ) inequalities to hold, rather than our 0(p). 
Of course, we had to introduce many additional assumptions ((Al) - (A3)) to reach this 
stage and no set of assumptions is uniformly stronger or weaker than the other. However, our 
proposed method is computationally feasible. 



5 Further applications of Backtracking 

Backtracking was presented in the context of the Lasso for the linear model, where we were 
also able to derive some theoretical properties of the procedure. However, the real power of 
the idea is that it can be incorporated into any method that produces a path of increasingly 
complex sparse solutions by solving a family of convex optimisation problems parametrised 
by a tuning parameter. For the Backtracking step (Algorithm [3]) , the KKT conditions for 
these optimisation problems provide a way of checking whether a given trial solution is an 
optimum. As in the case of the Lasso, checking whether the KKT conditions are satisfied 
typically requires much less computational effort than computing a solution from scratch. 
Below we briefly sketch some applications of Backtracking to a few of the many possible 
methods with which it can be used. 
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5.1 Multinomial regression 



An example, which we apply to real data in Section 6.2 is multinomial regression with a 



group Lasso (Yuan and Lin, 2006) penalty. Consider n observations of a categorical response 



that takes J levels, and p associated covariates. Let Y be the indicator response matrix, with 



th 



entry equal to 1 if the i th observation takes the j' th level, and otherwise. We model 



exp + 



(X s0 f3° 



U) 



£/, =1 exp +(X S0 P^' 



Here fi° is a vector of intercept terms and f3° is a \S°\ x J matrix of coefficients. This model 
is over-parametrised, but regularisation still allows us produce estimates of /jP and /3° and 



hence also of II (see Friedman, Hastie and Tibshirani (2010)). When our design matrix is 



Xc, these estimates are given by (/x, ft) = argmin /3; A) where 



J I J 

Q(/x,/3;A) = ^Y^ifXjl+Xc^-^log ^expfal + XcP®) I +A ^ 

j=l \j=l J v&C 



The functions log and exp are to be understood as applied componentwise and the rows of j3 
are indexed by elements of C. To derive the Backtracking step for this situation, we turn to 
the KKT conditions which characterise the minima of Q: 



i(Y T -n T (/i,/3;Ac))l = 0. 
}(Y T -lP{^kx c ))X^ = - 



for(/3 T )^0, 



(Y T - n T (/i, 0; I c ))lW 2 < A for (/3 T ) W = 0. 



Thus, analogously to for D D C, (f3 T (\, D)) D \ C = and (f3 T (X,D)) c = /3 T (X,C) if 

and only if 



max ■ 

v£D\C 



(Y T -U T (fi(X,C),(3(X,Cy,Xc))X 



(v) 



< A. 



5.2 Structural sparsity 

Although in our Backtracking algorithm, interaction terms are only added as candidates for 
selection when all their lower order interactions and main effects are active, this hierarchy in 
the selection of candidates does not necessarily follow through to the final model: one can 
have first-order interactions present in the final model without one or more of their main 
effects, for example. One way to enforce the hierarchy constraint in the final model is to use 
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a base procedure which obeys the constraint itself. Examples of such base procedures are 



provided by the Composite Absolute Penalties (CAP) family (Zhao, Rocha and Yu, 2009). 



Consider the linear regression setup (2.1 ) with interactions. For simplicity we only describe 
Backtracking with first-order interactions. Let C be the candidate set and let / = C \ C\ 

be the (first-order) interaction terms in C . In order to present the penalty, we borrow some 

(r) a (r) 

notation from Combinatorics. Let C\ denote the set of r-subsets of C\. For A C C\ and 

r > 1, define 

di(A) = {v £ c[ r ^ : v C u for some u £ A} 
d u (A) = {v £ c[ r+1 ^ : v C u for some u G A} 



These are known as the lower shadow and upper shadow respectively (Bollobas 1986). 
Our objective function is Q is given by 

Q(^P) = h W Y ~ V 1 ~ x cP\\l + A ||/3c x \fli0o||i + A l|0{«}u(flUW)nJ)|| 7 + A WPiWi > 

where 7 > 1. For example, if C = {{1}, . . . , {4}, {1, 2}, {2, 3}}, then omitting the factor of A, 
the penalty terms in Q are 

|%}| + ||(/3 { 1 } ,/3 { 1, 2} ) T || 7 + ||(/3{2>, /?{1,2>, /5{2,3>)^||^ ||(/5 {3 >, /5{2,3>)^||^ |/3{1,2>I |/3{2,3>|- 

The form of this penalty forces interactions to enter the active set only after or with their 
corresponding main effects. 

The KKT conditions for this optimisation take a more complicated form than those for the 
Lasso. Nevertheless, checking they hold for a trial solution is an easier task than computing 
a solution. 



5.3 Nonlinear models 



If a high-dimensional additive modelling method (Ravikumar et al, 2009; Meier, van de Geer 



and Biihlmann, 2009) is used as the base procedure, it is possible to fit nonlinear models with 



interactions. Here each variable is a collection of basis functions, and to add an interaction 
between variables, one adds the tensor product of the two collections of basis functions, 
penalising the new interaction basis functions appropriately. Structural sparsity approaches 



can also be used here. The VANISH method of Radchenko and James (2010 ) uses a CAP-type 



penalty in nonlinear regression, and this can be used as a base procedure in a similar way to 
that sketched above. 
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5.4 Introducing more candidates 



In our description of the Backtracking algorithm, we only introduce an interaction term when 
all of its lower order interactions and main effects are active. Another possibility, in the 



spirit of MARS (Friedman, 1991), is to add interaction terms when any of their lower order 
interactions or main effects are active. As at the k th step of Backtracking, there will be 
roughly kp extra candidates, an approach that can enforce the hierarchical constraint may be 
necessary to allow main effects to be selected from amongst the more numerous interaction 
candidates. The key point to note is that if the algorithm is terminated after T steps, we are 
having to deal with roughly at most Tp variables rather than 0(p 2 ), the latter coming from 
including all first-order interactions. 



6 Numerical results 
6.1 Simulations 

In this section, we report the results of five numerical studies designed to demonstrate the 
effectiveness of Backtracking with the Lasso and also highlight some of the drawbacks of using 
the Lasso with main effects only, when interactions are present. In each of the five scenarios, 
we generated 200 design matrices with n = 250 observations and p = 5000 covariates. The 
rows of the design matrices were sampled independently from N p (0, S) distributions. The 
covariance matrix X was chosen to be the identity in all scenarios except scenario 2, where 

5^ = o.75H^I- p / 2 l- p / 2 . 

Thus in this case, the correlation between the components decays exponentially with the 
distance between them in 7Ljp7L. 

We created the responses according to the linear model with interactions and set the 
intercept to 0: 

Y = X s0 p% + e, €i li ~ N(0, a 2 ). (6.1) 

The error variance cr 2 was chosen to achieve a signal-to- noise ratio (SNR) of 2.5, which we 
define here by 

SNR " Ejjijp ■ 

The set of main effects in S°, S®, was {{1},...,{6}}. The set of first-order interactions in S° 
chosen in the different scenarios, 5°, is displayed in Table [TJ and we took S° = S® U 5° so S° 
contained no higher order interactions. In each simulation run, /3^o was fixed and given by 

(2, -1.5, 1.25, -1,1, -if. 
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Scenario S° 

1 

2 

3 {{1,2}, {3,4}, {5,6}} 

4 {{1,2}, {1,3},..., {1,6}} 



5 I({{1},{2} ) {3}})UI({{4},{5},{6}}) 



Table 1: Simulation settings 



Thus 



0.75 



Each component of /5^ was chosen to be 0.75 
the signal strength of the interactions was 3/4 that of the main effects. 

In all of the scenarios, we applied Backtracking with the Lasso, and the Lasso with main ef- 
fects only. We constrained Backtracking to include only first-order interactions. Additionally, 
in scenarios 3-5, we applied the Lasso with all main effects and only the true interactions. This 
theoretical Oracle approach provided a gold standard against which to test the performance 
of Backtracking. For each of these procedures, we used grids of 100 A values. The LARS-OLS 
hybrid and cross-validation with squared error loss were used to give the final estimator. For 
our cross-validation procedure we randomly selected 5 folds each time but repeated this a 
total of 5 times to reduce the variance of the cross-validation scores. Thus for each A value we 
obtained an estimate of the expected prediction error that was an average over the observed 
prediction errors on 25 (overlapping) validation sets of size n/5 = 50. To further reduce the 
variance of the cross-validation scores and minimise the computational burden, both the size 
of the active set restricted to 50 and the size of restricted to p + 50 x 49/2 = 1225 (see 
Section 3.1.1). Our restricted minimisers of the cross-validation scores were always very far 
from these boundaries so it is likely they coincided with the global minimisers. 

In scenario 1, the results of Backtracking and using the main effects only were almost 
identical: Backtracking only incorrectly selected an interaction term in 2 out of the 200 
simulation runs. Both methods were able to correctly identify the signal variables in all but 
a few cases. The high correlations between signal variables and between signal and noise 
variables in scenario 2 made it a much more challenging situation, and both procedures 
struggled to select more than two main effects. Despite this, interaction terms were selected 
in only 3 of the simulation runs and the results for the two methods were again essentially 
indistinguishable. We see that when the true data generating process contains no interaction 
terms, the presence of the 1 path in Backtracking means we see only minimal deterioration 
in performance over using the main effects only. 

The results of scenarios 3-5, where the signal contains interactions, are given in Table [2j 
In the first panel, we give boxplots of the expected squared distance of the signal /° and 
our prediction functions / basted on training data (Itraim -Strain); evaluated at a random 
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independent test observation x new : 

^^now,ytrain,^train(/ ( x now) — / (^new ', ^rain j -Strain ) ) 



We also report variable selection performances of the main effects only, Backtracking and 
the Oracle procedures. Backtracking does almost as well as the Oracle procedure in each of 
the three scenarios, though with the large number of interactions in scenario 5, there is a 
slight difference in performance. Using the main effects only of course results in no interac- 
tions selected, and a much higher expected MSE. However, we also observe the phenomenon 
described in Section [2] important main effects are often missed. Note that the higher number 
of false selections incurred by both Backtracking and the Oracle procedure compared to using 
the main effects only, is due to the model selection criterion being the expected prediction 
error. It should not be taken as an indication that the latter procedure is performing well. 

6.2 Real data analyses 

In this section, we compare the performances of Backtracking with two base procedures, and 
simply using the base procedures themselves, on three real data sets. The base procedures we 
consider are the Lasso for the linear model and the Lasso for multinomial regression. Below 
we describe the data sets used. 



6.2.1 Melting points 

Quantitative Structure Activity / Property Relationship (QSAR / QSPR) studies are a com- 
mon and successful research approach to many problems in the biological and chemical sci- 
ences. The broad goal is to understand physical, chemical and biological properties of com- 
pounds in terms of their molecular structure. In order to do this, many molecular descriptors 
are generated for compounds for which a property has been observed, and then a regres- 
sion is performed on this high-dimensional data. The example we study here comes from 



Karthikeyan, Glen and Bender (2005). There are n = 4173 compounds for which the melting 



points are measured, and for each of these, p = 202 predictors have been generated based on 



the structure of the molecules. The dataset is available at |http : //pubs . acs . org 



6.2.2 Communities and Crime 



This data set from the UCI machine learning repository Frank and Asuncion (2010) available 



at http: //archive . ics .uci . edu/ml/datasets/Communities+and+Crime+Unnormalized con- 
tains crime statistics for 1995 obtained from FBI data, and national census data from 1990, 
for various towns and communities around the USA. We took violent crimes per capita as 
our response: violent crime being defined as murder, rape, robbery, or assault. The data 
set contains two different estimates of the populations of the communities: those from the 
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1990 census and those from the FBI database in 1995. The latter was used to calculate our 
desired response using the number of cases of violent crimes. However, in several cases, the 
FBI population data seemed suspect and we discarded all observations where the maximum 
of the ratios of the two available population estimates differed by more than 1.25. In addition, 
we removed all observations that were missing a response and several variables for which the 
majority of values were missing. This resulted in a dataset with n = 1903 observations and 
p = 101 covariates. 



6.2.3 ISOLET 

This data set consists of p = 617 features based on the speech waveforms generated from 
utterances of each letter of the English alphabet. The task is to learn a classifier which 
can determine the letter spoken based on these features. The dataset is available from the 
UCI machine learning repository at http://archive.ics.uci.edu/ml/datasets/ISOLET; 



sec 



Fanty and Cole (1991) for more background on the data. We consider classification on 
the notoriously challenging E-set consisting of the letters 'B', 'C, 'D', 'E', 'G', 'P', 'T', 'V 
and 'Z' (pronounced 'zee'). As there were 150 subjects and each spoke each letter twice, we 
have n = 2700 observations spread equally among 9 classes. 



6.3 Methods and results 

For the Melting points and Communities and crime data sets, we used the Lasso for the 
linear model as the base regression procedure for Backtracking. The response in each of these 
cases was scaled to have variance 1. Since the per capita violent crime response was always 
non-negative, the positive part of the fitted values was taken. We employed 5-fold cross- 
validation with squared error loss to select tuning parameters for each of the methods. For 



the classification example, penalised multinomial regression was used (see Section 5.1) and 
the deviance was used as the loss function for 5-fold cross-validation. In all of the examples, 
we only included first-order interactions and also restricted the size of Ck to p + 50 x 49/2 = 
p+ 1225. 

To evaluate the procedures, we randomly selected 2/3 for training and the remaining 1/3 
was used for testing. This was repeated 200 times for each of the data sets. Note that we 
have specifically chosen data sets with n large as well as p large. This is to ensure that 
comparisons between the performances of the methods can be made with more accuracy. For 
the regression examples, out-of-sample squared prediction error was used as a measure of 
error; for the classification example, we used out-of-sample misclassification error with 0-1 
loss. The results are given in Table [3| 

For the Melting points data set, we see that Backtracking offers a modest but nevertheless 
consistent improvement over simply using main effects. Although here it appears simply using 
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Error 


% of test sets where 


Dataset 


Main effects 


only Backtracking 


Backtracking performs better 


Melting points 


0.506 


0.493 


90 


Communities and crime 


0.403 


0.364 


93 


ISOLET 


0.0641 


0.0563 


90 



Table 3: Real data analysis results 



main effects offers good performance, Backtracking is nevertheless able to pick out subtle 
interaction terms whose inclusion in a model can make improvements in prediction error. 

With the Communities and Crime and ISOLET data sets, Backtracking significantly out- 
performs simply using the main effecst. For the Communities and crime data, we had initially 
used ^-penalised Poisson regression with log of the populations as an offset and the number of 
violent crimes as the response. However, Backtracking performed very poorly, and including 
only the main effects was much worse. A possible reason for this could be unobserved hetero- 
geneity between the different populations in the communities, which would suggest fitting a 
mixed effects model. Thus the linear model approach we have taken here may not be ideal, 
but we see that the richer class of models provided Backtracking is better able to cope with 
the potential model misspecification. Similarly we see that for the ISOLET data, the more 
complex decision boundaries permitted by Backtracking result in a large relative improvement 
in prediction error compared to using penalised multinomial regression with only the main 
effects. 
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7 Appendix 



In this section, after presenting a lemma on the entry condition (Section 4.1), we prove 
Theorem [T] and Corollary [2] The proofs of Lemma [3] below, and Corollary [2] use many ideas 
from Wainwright ( 2009[ ) and |Buhlmann and van de Geer ( 2009 ) . 



Lemma 3. Let S C C be such that X$ has full column rank and let M = C \ S. On the 

event 

the following hold: 
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(i) V 



A > max 



n 1 


(I ~ P S )f° 


+ 2 V 


1 - 




i 



then the Lasso solution is unique and /3m(A, C) = 0. 



(7.1) 



(ii) If \ is such that for some Lasso solution /3m(A, C) = 0, and for v £ S, 

|^|>||(S^)| i (A + ^, 

then for all Lasso solutions, f3 v (\, C) ^ 0. 



(Hi) Let 



A en = sup{A : A > and for some Lasso solution /3jy/(A, C) / 0}, 



where we take sup0 = 0. If for v G S, 



15,. I > max 

u&M 



MX^ T {I-P s )f\+2n 



1 



+ 7/ 



there exists a A > A cnt such that the solution /3(A, C) is unique, and for all A' G (A , A] 
and all Lasso solutions /3(A', C), we have (3 V (\' , C) / 0. 

Proof. We begin by proving (i). Suppressing the dependence of f3 on A and C, we can write 



the KKT conditions «|3l]), (Q) as 

1 



7) 



X%(Y-X c P) = \f, 



where f is an element of the subdifferential d 



P 



and thus satisfies 



IItIL < i, 

$ v ^ =>• f„ = sgn(/3„) 



(7.2) 
(7.3) 



By decomposing Y as P s f° + (I-P s )f°+e, X c as (X 5 and noting that X S {I-P S ) = 0, 
we can rewrite the KKT conditions in the following way: 



}X T s {P s f - X s Ps) + ±X$e - Z s>M p M = Xr s , 
±Xl(P S f - X s Ps) + lX T M {(I - P s )f° + e} - S m ,m/3m = Af M . 



(7.4) 
(7.5) 
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Now let f3s be a solution to the restricted Lasso problem, 

(/!,&) = argmin \\Y - fil - XsPsf + A \\0 S \\i} ■ 



The KKT conditions give that 0s satisfies 

1 



n 



X 1 s (Y-X s /3 s ) = Xt s , 



(7.6) 



where fs £ d 



0s 



We now claim that 



(0s, 0m) = (0s, 0) 
(fs, r M ) = (r s , XMfVgUfs ~ - n ^ l X T s e) + ^Xj^I - P s )f + e} 



(7.7) 
(7.8) 



is the unique solution to (7.4), (7.5), (7.2) and (7.3). Indeed, as 0s solves the reduced Lasso 
problem, we must have that (7.4) and (7.3) are satisfied. Multiplying (7.4) by Xs£^, setting 



0M = and rearranging gives us that 



P s f° - X s 0s = X s ^J\ts - lX T s e), 



(7.9) 



and substituting this into (7.5) shows that our choice of t~m satisfies (7.5). It remains to check 



that we have ||tm < 1. In fact, we shall show that Hf&fH^ < 1. Since we are on tie and 
H^slloo — 1) f° r u € M we have 



A|f„| < 

< A 

< A 



S,S^S,{u} 



(A Moo+||^IL) + -n X^ T (I~P S )f 



+ 



+ 



X^ T (I-P s )f° 



+ 2r] 



where the final inequality follows from (7.1). We have shown that there exists a solution, (3, 



to the Lasso optimisation problem with /3m = 0. The uniqueness of this solution follows from 



noting that ||Tjf || < 1, X$ has full column rank and appealing to Lemma 1 of Wainwright 



(2009). 



For (ii), note that from (7.4), provided 0m = 0, we have that 

0s = S -^S\(\T S - 1 -X T s e). 



But by assumption 



\0v\ > 



(E^)W (A + 77) > 



-1 \( v ) A 



J s,s) 



(A% 



IxU 
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whence j3 v ^ 0. 

(iii) follows easily from (i) and (ii). □ 
Proof of Theorem [T] 

In all that follows, we work on the event f2i n Og v where fl^o v is defined as in Lemma |3j 
and 

ni = {Y = h1 + X a Pa, some fi,P,Ae V{d) : \A\ < \S°\}. 
Clearly as \S°\ < n, P(fix) = 1. Note that trivially, 

A-.ACC 

so Lemma [3] can be used in conjunction with (A3) to deduce that certain variables enter the 
active set before certain sets of noise variables. Using standard bounds for the tails of Gaussian 
random variables and the union bound, it is easy to show that P(Oi nfi^o „) > 1 — exp(— 1 2 /2) 
for t > 1. 

Let Cfcmax be the largest member of {C\, . . . ,Ct} satisfying C^max C C°. Such a Cfcnu™ 
exists since C\ C C°. 

Now suppose that for k < fc max , C k ^ <5°. We shall show that then k + 1 < T and 
Cfc+i C C°, thus showing that we may take k* = k max . Take j max such that 

i({{i},...,{r-i}})cft, 

with j max maximal. Since = 0, such a j max exists. Let A = C k \ C x . Note that 

{j max - 1}}) C A C C° \ d = 
Thus by (A3), we know that for all j < j max ; the Ent({j}, C& \S°, C k , rj) condition holds. Let 
A ent = sup{A : A > and for some Lasso solution vgo (A, C&) 7^ 0}, 

where we take the supremum to be if the set is empty. By Lemma |3j part (iii), we know 
that for all j < j max , there exists a Xj > A ent such that {j} £ A0(X,C h )) for all A £ 
(A ent ,Aj], and moreover, we know that the Lasso solution at Xj is unique. Note that as 
A(/3(Xj,Ck) C 5°, the fact that we are on fii means we do not have a perfect fit at Xj, i.e. 
Y-,j,l-XcJ(Xj,C k ) 



> 0. Let A al1 = min,- A,-. Then 

2 



A(/3(A all ,C fc ))D{{l},...,{j max }} and 
/3 CfeV? o(A,C fc ) = forallA>A a11 
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That is, A is a point on the solution path at which variables {1}, . . . , {j max } are in the active 
set, and before which no variable from Cf. \ S° is active. 



Now it remains to understand what this means for the approximate solution paths, j3, 
computed by Backtracking. For the case k = 1, we have /?(•, Ci) = /3(-,Ci), and so we can 
conclude that k + 1 = 2 < T, and C 2 C C°. 

For the case k > 1, suppose first (for contradiction) that A| tart < A al1 . Note that 

A0(\t*\ C k )) = A0(Xt art , C fc _i)). 

Now we must have that 

^/5(^fc tart > Cfc-i)) C §°, (7.10) 

-4(/3(A| tart , C fc _0) ^ {{1}, . . . , {j max }} (7.11) 

as otherwise, by the design of our Backtracking algorithm, either C/% ^ C° or 5 X({{1}, . . . , {j max }}). 
By LemmajiJ part (ii), we know that for each j < j' max , if for some A < Xj, i3 Ck \§a(X, Ck) = 0, 



then (3{j}{X, Ck) ^ 0. But since A| < A , by (7.10) A| is such a A, which then contradicts 



(7.11). 



Thus A| tart > A al1 , so we can conclude that k + 1 < T and that C k +i Q&. □ 



Proof of Corollary [2] 

Let Q\ and ^ be defined as in Lemma |3j Also define the events 

Q 2 = { 1 -\\X^(I-P s °)e < V }, 



V- 1 Y T e 



In all that follows, we work on the event f?i D f^2 n D fl^ „• As I — P s ° is a projection, 

F{^\X^ T (I - P s °)e\ <i])> F(±\X^ T e\ < 77). 
Further, 50 Xj e ~ iVi 501(0, a 2 Hgo so ). Thus 

P(0 3 ) > |5°|P(|Z| < 



where Z ~ N(0, <7 2 /(nc m i n (£so 50))). Note that 



P(Oi nfi 2 no 3 n ty*^) > 1 - p(% ,„) 



p(ft c 2 ) - p(ng) 
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Usign this, it is straightforward to show that P(f2i n SI2 H fl Og, ) > 1 — 3 exp(— 1 2 /2), and 
this is valid even when t < 1. 

Since we are on n fi^-o ^, we can assume the existence of a k* from Theorem jlj We now 
follow the proof of Lemma [3] taking S = S° and M = C^* \ 5° C N. The KKT conditions 
become 



with f also satisfying (7.2) and (|7.3|) as before. Now let A be such that 

n 



1 



S M,So£ >g o 1 5 o S S n 3 <? 



< A < min 



It is straightforward to check that 



0S°iPm) = (^50 - AS s0 1 i50 sgn(/3^ ) + ^S S o iS o^|oe, 0) 
(f s o,f M ) = (sgn(/3° ), E^soE^^sgnOSgo) + ^~ l X T M (I ~ P S °)e 



(7.12) 
(7.13) 



is the unique solution to (7.12), (7.13), ( |7.2[ ) and (7.3). The only danger is that we may have 
A > A|* art . However, we know that /3A/(A| tart , Ck* ) = 0. It is easy to check that in this case 
we still have sgn(/3 5 o(A| t « art , Ck*)) = sgn(/3° ), and thus we may take A* = min{A, A| tart }. □ 
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